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Abstract 

In high-dimcnsional model selection problems, penalized least-square approaches 
have been extensively used. This paper addresses the question of both robustness and 
efficiency of penalized model selection methods, and proposes a data-driven weighted 
linear combination of convex loss functions, together with weighted Li-penalty. It is 
completely data-adaptive and does not require prior knowledge of the error distribu- 
tion. The weighted Li-penalty is used both to ensure the convexity of the penalty 
term and to ameliorate the bias caused by the Li-penalty. In the setting with dimen- 
sionality much larger than the sample size, we establish a strong oracle property of the 
proposed method that possesses both the model selection consistency and estimation 
efficiency for the true non-zero coefficients. As specific examples, we introduce a robust 
method of composite L1-L2, and optimal composite quantile method and evaluate their 
performance in both simulated and real data examples. 
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1 Introduction 



Feature extraction and model selection are important for sparse high dimensional data 
analysis in many research areas such as genomics, genetics and machine learning. Motivated 
by the need of robust and efficient high dimensional model selection method, we introduce 
a new penalized quasi-likelihood estimation for linear model with high dimensionality of 
parameter space. 

Consider the estimation of the unknown parameter /3 in the linear regression model 

Y = X/3 + e, (1) 

where Y = (Yi, • • • , Yn)'^ is an n-vector of response, X = (Xi, • • • , X„)"^ is an n x p matrix 
of independent variables with Xf being its i-th row, f3 = (/3i, /3p)-^ is a p- vector of 
unknown parameters and e = (ei, ...,en)'^ is an n-vector of i.i.d. random errors with mean 
zero, independent of X. When the dimension p is high it is commonly assumed that only a 
small number of predictors actually contribute to the response vector Y, which leads to the 
sparsity pattern in the unknown parameters and thus makes variable selection crucial. In 
many applications such as genetic association studies and disease classifications using high- 
throughput data such as microarrays with gene-gene interactions, the number of variables 
p can be much larger than the sample size n. We will refer to such problem as ultrahigh- 
dimensional problem and model it by assuming logp = 0{n^) for some 5 G (0, 1). Following 
Fan and Lv (2010), we will refer to p as a non-polynomial order or NP-dimensionality for 
short. 

Popular approaches such as LASSO (Tibshirani, 1996), SCAD (Fan and Li, 2001), 
adaptive LASSO (Zou, 2006) and elastic-net (Zou and Zhang, 2009) use penalized least- 
square regression: 

n p 

P = argmin^ {Y, - Xf /3)' + n ^^^(1/3,1). (2) 
1=1 j=i 

where pa(') is a specific penalty function. The quadratic loss is popular for its mathematical 
beauty but is not robust to non-normal errors and presence of outliers. Robust regressions 
such as the least absolute deviation and quantile regressions have recently been used in 
variable selection techniques when p is finite (Wu and Liu, 2009; Zou and Yuan, 2008; 
Li and Zhu, 2008). Other possible choices of robust loss functions include Huber's loss 
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(Huber, 1964), Tukey's bisquare, Hampel's psi, among others. Each of these loss functions 
performs well under a certain class of error distributions: quadratic loss is suitable for 
normal distributions, least absolute deviation is suitable for heavy-tail distributions and 
is the most efficient for double exponential distributions, Huber's loss performs well for 
contaminated normal distributions. However, none of them is universally better than all 
others. How to construct an adaptive loss function that is applicable to a large collection 
of error distributions? 

We propose a simple and yet effective quasi-likelihood function, which replaces the 
quadratic loss by a weighted linear combination of convex loss functions: 

K 

Pv, = ^WkPk, (3) 

k=l 

where pi,...,pK are convex loss functions and wi,...,wk are positive constants chosen 
to minimize the asymptotic variance of the resulting estimator. From the point of view 
of nonparametric statistics, the functions {pi,--- ,pk} can be viewed as a set of basis 
functions, not necessarily orthogonal, used to approximate the unknown log-likelihood 
function of the error distribution. When the set of loss functions is large, the quasi- 
likelihood function can well approximate the log-likelihood function and therefore yield 
a nearly efficient method. This kind of ideas appeared already in traditional statistical 
inference with finite dimensionality (Koenker, 1984; Bai et al., 1992). We will extend it to 
the sparse statistical inference with NP-dimensionality. 

The quasi-likelihood function (3) can be directly used together with any penalty func- 
tion such as Lp-penalty with < p < 1 (Frank and Friedman, 1993), LASSO i.e. Li- 
penalty (Tibshirani, 1996), SCAD (Fan and Li, 2001), hierarchical penalty (Bickel et al., 
2008), resulting in the penalized composite quasi-likelihood problem: 

n p 

mmY,P^{Yi - Xf/3) + n J]p,(|/3,|). (4) 

i=i j=i 

Instead of using folded-concave penalty functions, we use the weighted Li- penalty of the 
form 

nE7A(|/3f 1)1/3,1 
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for some function 7^ and initial estimator /3' to ameliorate the bias in Li-penalization 
(Fan and Li, 2001; Zou, 2006; Fan and Lv, 2010) and to maintain the convexity of the 
problem. This leads to the following convex optimization problem: 

n p 
X = argminj^pw (y, - Xf/3) +nY,lx{Wfm\ (5) 
i=i j=i 

When 7a(') = p'xi')^ the derivative of the penalty function, (5) can be regarded as the local 
linear approximation to problem (4) (Zou and Li, 2008). In particular, LASSO (Tibshirani, 
1996) corresponds to 7a (a^) = A, SCAD reduces to (Fan and Li, 2001) 

7a(x) = A{I(x < A) + > A)}, (6) 

and adaptive LASSO (Zou, 2006) takes 'y\{x) = A|a;|~" where a > 0. 

There is a rich literature in establishing the oracle property for penalized regression 
methods, mostly for large but fixed p (Fan and Li, 2001; Zou, 2006; Yuan and Lin, 2007; 
Zou and Yuan, 2008). One of the early papers on diverging p is the work by Fan and Peng 
(2004) under conditions of p = 0(n^/^). More recent works of the similar kind include 
Huang et al. (2008), Zou and Zhang (2009), Xie and Huang (2009), which assume that the 
number of non-sparse elements s is finite. When the dimensionality p is of polynomial 
order, Kim et al. (2008) recently gave the conditions under which the SCAD estimator is 
an oracle estimator. We would like to further address this problem when logp = 0{n^) 
with 5 G (0,1) and s = 0{n"'^) for G (0,1), that is when the dimensionality is of 
exponential order. 

The paper is organized as follows. Section 2 introduces an easy to implement two- 
step computation procedure. Section 3 proves the strong oracle property of the weighted 
Li-penalized quasi-likelihood approach with discussion on the choice of weights and cor- 
rections for convexity. Section 4 defines two specific instances of the proposed approach 
and compares their asymptotic efficiencies. Section 5 provides a comprehensive simulation 
study as well as a real data example of the SNP selection for the Down syndrome. Section 
6 is devoted to the discussion. To facilitate the readability, all the proofs are relegated to 
the Appendices A, B & C. 
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2 Penalized adaptive composite quasi-likelihood 



We would like to describe the proposed two-step adaptive computation procedure and defer 
the justification of the appropriate choice of the weight vector w to Section 3. 

In the first step, one will get the initial estimate ^ using the LASSO procedure, i.e: 

n p 

= argmin^ (y, - Xf/3)' + nA |/3,|. 
i=i j=i 

and estimate the residual vector = Y — ^ (for justification see discussion following 
Condition 2). The matrix M and vector a are calculated as follows: 

^ n 1 " 

Mfci = -V^fe(e°M(e?), and a,. = - V {k,l = 1, K), 

n ^-^ n ^-^ 

i=l i=l 

where ipk{t) is a choice of the subgradient of pk{t), is the i-th component of e^, and 
Ofc should be considered as a consistent estimator of Edip^ie), which is the derivative of 
Eipk{e + c) at c = 0. For example, when ipkix) = sgn(x), then Eipi^{e + c) = 1 — 2Fs{—c) 
and Ed%l)k{E) = 2/^(0) . The optimal weight is then determined as 

Wopt = argmin^>o^aTw=iW^Mw. (7) 

In the second step, one calculates the quasi maximum likelihood estimator (QMLE) using 
weights Wopt as 

n p 

3" = argmin J^pw^,. {Y, - l^J 0) + nY,lx{\Pf\mV (8) 
^ i=i j=i 

Remark 1: Note that zero is not an absorbing state in the minimization problem (8). 
Those elements that are estimated as zero in the initial estimate (3^^'^ have a chance to 
escape from zero, whereas those nonvanishing elements can be estimated as zero in (8). 
Remark 2: The number of loss functions K is typically small or moderate in practice. 
Problem (7) can be easily solved using a quadratic programming algorithm. The result- 
ing vector MVopt can have vanishing components, automatically eliminating inefficient loss 
functions in the second step (8) and hence learning the best approximation of the unknown 
log-likelihood function. This can lead to considerable computational gains. See Section 4 
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for additional details. 

Remark 3: Problem (8) is a convex optimization problem when /9^.'s are all convex and 
7A(|/5j*^^|) are all nonnegative. This class of problems can be solved with fast and efficient 
computational algorithms such as pathwise coordinate optimization (Friedman et al., 2008) 
and least angle regression (Efron et al., 2004). 

One particular example is the combination of Li and L2 regressions, in which K = 2, 
Pi{t) = \t—bo\ and P2{t) = t^- Here 60 denotes the median of error distribution e. If the error 
distribution is symmetric, then 69 = 0. If the error distribution is completely unknown, 
60 is unknown and can be estimated from the residual vector {e^*} or being regarded as 
an additional parameter and optimized together with /3 in (8). Another example is the 
combination of multiple quantile check functions, that is, 

Pk{t) = Tk{t - bk)+ + (1 - Tk){t - bk)~, 

where G (0, 1) is a preselected quantile and bk is the r^-quantile of the error distribution. 
Again, when bk^s are unknown, they can be estimated using the sample quantiles Tk of the 
estimated residuals £^ or along with /3 in (8). See Section 4 for additional discussion. 

3 Sampling properties and their applications 

In this section, we plan to establish the sampling properties of estimator (5) under the 
assumption that the number of parameters (true dimensionality) p and the number of 
non-vanishing components (effective dimensionality) s = \\/3*\\o satisfy logp = 0{n^) and 
s = 0{n°"') for some 5 G (0,1) and G (0)1)- Particular focus will be given to the 
oracle property of Fan and Li (2001), but we will strengthen it and prove that estimator 
(5) is an oracle estimator with overwhelming probability. Fan and Lv (2010) were among 
the first to discuss the oracle properties with NP dimensionality using the full likelihood 
function in generalized linear models with a class of folded concave penalties. We work on 
a quasi-likelihood function and a class of weighted convex penalties. 

3.1 Asymptotic properties 

To facilitate presentation, we relegate technical conditions and the details of proofs to the 
Appendix. We consider more generally the weighted Li-penalized estimator with nonneg- 
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ative weights di, - ■ ■ ,dp. Let 

n p 

Ln{(3) = ^ Pw {Yi - y.J(3) + nXn dj |/?,- 1 (9) 
i=i j=i 

denote the penalized quasi-hkehhood function. The estimator in (5) is a particular case of 
(9) and corresponds to the case with dj = 7a(|/3]'^^|)/A„. 

Without loss of generality, assume that parameter /3* can be arranged in the form of 
P* = (/3*^,0'^)^, with PI € a vector of non- vanishing elements of /3*. Let us call 
P = {Pi , 0^)^ G the biased oracle estimator, where P^ is the minimizer of Ln{Pi,0) 
in R'^ and is the vector of all zeros in R^"'^. Here, we suppress the dependence of on 
w and d = {di, • • • , dp)'^ . The estimator P is called the biased oracle estimator, since the 
oracle knows the true submodel 7W* = {j : /?* / 0}, but nevertheless applies a penalized 
method to estimate the non-vanishing regression coefficients. The bias becomes negligible 
when the weights in the first part are zero or uniformly small (see Theorem 3.2). When 
the design matrix S is non-degenerate, the function L„(/3i,0) is strictly convex and the 
biased oracle estimator is unique, where S is a submatrix of X such that X = [S, Q] with 
S and Q being n x s and n x {p — s) sub-matrices of X, respectively. 

The following theorem shows that P is the unique minimizer of Ln{P) on the whole 
space with an overwhelming probability. As a consequence, P^ becomes the biased 
oracle. We establish the following theorem under conditions on the non-stochastic vector 
d (see Condition 2). It is also applicable to stochastic penalty weights as in (8); see the 
remark following Condition 2. 

Theorem 3.1 Under Conditions 1-4, the estimators P and P^ exist and are unique on 
a set with probability tending to one. Furthermore, 

P(P^ = P°) > I - (p - s) exp{-cn("«-2"i)++2"2} 

for a positive constant c. 

For the previous theorem to be nontrivial, we need to impose the dimensionality re- 
striction 5 < (ao — 2ai)+ -1- 2q2, where ai controls the rate of growth of the correlation 
coefficients between the matrices S and Q, the important predictors and unimportant 
predictors (see Condition 5) and 02 G [0,1/2) is a non-negative constant, related to the 
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maximum absolute value of the design matrix [see Condition 4]. It can be taken as zero 
and is introduced to deal with the situation where (ao ~ 2ai)_|_ is small or zero so that 
the result is trivial. The larger a2 is, the more stringent restriction is imposed on the 
choice of A„. When the above conditions hold, the penalized composite quasi-likelihood 



exponentially fast. 

Remark 4: The result of Theorem 3.1 is stronger than the oracle property defined in 
Fan and Li (2001) once the properties of are established (see Theorem 3.2). It was 
formulated by Kim et al. (2008) for the SCAD estimator with polynomial dimensional- 
ity p. It implies not only the model selection consistency and but also sign consistency 
(Zhao and Yu, 2006; Bickel et al., 2008, 2009): 



In this way, the result of Theorem 3.1 nicely unifies the two approaches in discussing the 
oracle property in high dimensional spaces. 

Let [3^1 and /3^2 the first s components and the remaining p — s components of 
respectively. According to Theorem 3.1, we have j3^2 — with probability tending to 
one. Hence, we only need to establish the properties of /S^i- 

Theorem 3.2 Under Conditions 1-5, the asymptotic bias of non-vanishing component 
/3-wi controlled by Dn = niax{dj : j G with 



estimator is equal to the biased oracle estimator P , with probability tending to one 



P(sgn(3^) = sgn(/3*)) = P(sgn(^°) = sgn(/3*)) ^ 1. 




Furthermore, when < qq < 2/3, /S^i possesses asymptotic normality: 




(10) 



where h is a unit vector in and 



.2 



(11) 



w 




2 



Since the dimensionality s depends on n, the asymptotic normality of is not well 
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defined in the conventional probability sense. The arbitrary linear combination b I3^i is 
used to overcome the technical difficulty. In particular, any finite component of (i^i is 
asymptotically normal. The result in Theorem 3.2 is also equivalent to the asymptotic 
normality of the linear combination B"^/3^x stated in Fan and Peng (2004), where B is a 
q X s matrix, for any given finite number q. 

This theorem relates to the results of Portnoy (1985) in classical setting (corresponding 
to p = s) where he established asymptotic normality of M-estimators when the dimension- 
ality is not higher than o(n^/^). 

3.2 Covariance Estimation 

The asymptotic normality (10) allows us to do statistical inference for non-vanishing com- 
ponents. This requires an estimate of the asymptotic covariance matrix of fi^i- Let 
£ = Y — S'^P^i be the residual and ii be its i-th component. A simple substitution 
estimator of cr^ is 

— . ^ N 2 • 

See also the remark proceeding (7). Consequently, by (10), the asymptotic variance- 
covariance matrix of (3^i is given by 

dl{s^sr\ (12) 

Another possible estimator of the variance and covariance matrix is to apply the standard 
sandwich formula. In Section 5, through simulation studies, we show that this formula has 
good properties for both p smaller and larger than n (see Tables 3 and 4 and comments at 
the end of Section 5.1). 

3.3 Choice of weights 

Note that only the factor o"^ in equation (11) depends on the choice of w and it is invariant 
to the scaling of w. Thus, the optimal choice of weights for maximizing efficiency of the 
estimator (3^i is 

^opt = argminw Mw s.t. a w = 1, w > (13) 
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where M and a are defined in Section 2 using an initial estimator, independent of the 
weighting scheme w. 

Remark 5: The quadratic optimization problem (13) does not have a closed form solution, 
but can easily be solved numerically for a moderate K. The above efficiency gain, over the 
least-squares, could be better understood from the likelihood point of view. Let f{t) denote 
the unknown error density. The most efficient loss function is the unknown log-likelihood 
function, — log f{t). But since we have no knowledge of it, the set Tk-, consisting of convex 
combinations of {/Ofc(-)}fcLi given in (3), could be viewed as a collection of basis functions 
used to approximate it. The broader the set J^k is, the better it can approximate the 
log-likelihood function and the more efficient the estimator ^ in (8) becomes. Therefore, 
we refer to as the quasi-likelihood function. 

3.4 One-step penalized estimator 

The restriction of w > guarantees the convexity of so that the problem (5) becomes 
a convex optimization problem. However, this restriction may cause substantial loss of 
efficiency in estimating (see Table 1). We propose a one-step penalized estimator 
to overcome this drawback while avoiding non-convex optimization. Let (3 be the esti- 
mator based on the convex combination of loss functions (5) and fii be its nonvanishing 
components. The one-step estimator is defined as 



1 -1 





(14) 



where 



n 



^n,w(;9l)=5]V'w(l^.-Sf;3l)S, 



n 



i=l 



Theorem 3.3 Under Conditions 1-5, if \\(3i — /i 

^ OS 

mator (3^ (14) enjoys the asymptotic normality: 




one-step esti- 




(15) 



provided that s = o{n 



/^), d^{-) is Lipchitz continous, and XmaxiY17=i W^Wi^i^I 



) = 0{n^s) 
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where Amax(') denote the maximum eigenvalue of a matrix and a% are defined as in The- 
orem 3.2. 

The one-step estimator (14) overcomes the convexity restriction and is always well 
defined, whereas (5) is not uniquely defined when convexity of p-^ is ruined. Note that if 
we remove the constraint oi Wk > ^ {k = 1,...,K), the optimal weight vector in (13) is 
equal to 

Wopt = M~ia and a^^^^ = (a^M-^a)-!. 

This can be significantly smaller than the optimal variance obtained with convexity con- 
straint, especially for multi-modal distributions (see Table 1). 

The above discussion prompts a further improvement of the penalized adaptive compos- 
ite quasi-likelihood in Section 2. Use (8) to compute the new residuals and new matrix M 
and vector a. Compute the optimal unconstrained weight y^opt = ]V[~^a and the one-step 
estimator (14). 

4 Examples 

In this section, we discuss two specific examples of penalized quasi-likelihood regression. 
The proposed methods are complementary, in the sense that the first one is computationally 
easy but loses some general fiexibility while the second one is computationally intensive 
but efficient in a broader class of error distributions. 

4.1 Penalized Composite L1-L2 regression 

First, we consider the combination of Li and L2 loss functions, that is, pi{t) = \t — 6o| and 
P2{t) = t^ . The nuisance parameter 60 is the median of the error distribution. Let ^ 
denote the corresponding penalized estimator as the solution to the minimization problem: 

n n p 

argmin wi ^ \Yi - bo - Xf/3| + (l^ - Xf/3)' + nj^lxilf^f m\- (16) 

^=l i=l j=l 
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If the error distribution is symmetric, then 60 = and the minimization problem (16) can 
be recast as a penahzed weighted least square regression 



argmm 



P 



E 1 + (y.-Xf/3)%nj:7A(|/3f 1)1/3, 



which can be efficiently solved by pathwise coordinate optimization (Friedman et al., 2008) 
or least angle regression (Efron et al., 2004). 

If 60 7^ 0, the penalized least-squares problem (16) is somewhat different from (5) 
since we have an additional parameter bo. Using the same arguments, and treating bo as 
an additional parameter for which we solve in (16), we can show that the conclusions of 
Theorems 3.2 and 3.3 hold with the asymptotic variance equal to 

2 / N wj/A + w^a"^ +W2W1B 
^L,-L2M = : 777-— ^2 > (17) 

where B = E[e{I{e > 69) ~ < ^0))] /(") is the density of e. This will hold when 69 
is either known or unknown. Explicit optimization of (17) is not trivial and we go through 
it as follows. 

Since ci^^/^j^^) invariant to the scale of w, by setting wi/w2 = ca, we have 

2 / ^ 2cV4 + l + a£C 

[beC+l) 

where as = B/a and feg = (7/(60) • Note that 

\B\ < E\£\ [I{e > bo) + I{e < bo)] < a. 

Hence, |ae| < 1 and c^/A + 1 + a^c = (c/2 + 0^)^ + 1 - > 0. 

The optimal value of c over [0,oo) can be easily computed. If asbs < 0.5, then the 
optimal value is obtained at 

Ce = 2(26, - a,)+/{l - 2aebe). (19) 

In particular, when 26, < a,, = 0, and the optimal choice is the least-squares estimator. 
When ai^be = 0.5, if 26, < a,, then the minimizer is = 0. In all other cases, the minimizer 
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is Ce = oo i.e. we are left to use Li regression alone. 

The above result shows the limitation of the convex combination, i.e. c > 0. In many 
cases, we are left alone with the least-squares or least absolute deviation regression without 
improving efficiency. The efficiency can be gained and achieved by allowing negative weights 
via the one-step technique as in Section 3.4. Let g{c) = {c^/i + 1 + aec)/{beC+ 1)'^. The 
function g{c) has a pole at c = —1/be and a unique critical point 

Copt = 2(26, - a,)/(l - 2a,b,), (20) 

provided that a^be ^ 1/2. Consequently, the function g(c) can not have any local maximizer 
(otherwise, from the local maximizer to the point c = — l/ftg, there must exist a local 
minimizer, which is also a critical point). Hence, the miniiTiuiri value is cLtta-incd. cit Copt- 

In 

other words, 

minal r{w) = a'^mmg{c) = dea^, (21) 

w ^ c 

where 

4 = g{copt) = (1 - a2)/(462 _ Aaebe + 1). (22) 

Since the denominator can be written as (a^ — 26^)^ + (1 — a^), we have < 1, namely, it 
outperforms the least-squares estimator, unless = 26^. Similarly, it can be shown that 

1 - gg J_ 

' ~ 462[l-a2 + (2a,-l/6,)2/4] " W 

namely, it outperforms the least absolute deviation estimation, unless a^be = 1/2. 

When error distribution is symmetric unimodal, b^ > l/^/l2, according to Chapter 
5 of Lehmann (1983). The worst scenario for the Li-regression in comparison with the 
L2-regression is the uniform distribution (see Chapter 5, Lehmann (1983)), which has 
the relatively efficiency of merely 1/3. For such uniform distribution, = \/3/2 and 
bg = I/VT2, de = 3/4, and Copt = — 2/\/3. Hence, the best L1-L2 is 4 times better than Li 
regression alone. More comparisons about the weighted L1-L2 combination with Li and 
least-squares are given in Table l(Section 4.3). 

4.2 Penalized Composite Quantile Regression 

The weighted composite quantile regression (CQR) was first studied by Koenker (1984) 
in classical statistical inference setting. Zou and Yuan (2008) used equally weighted CQR 
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(ECQR) for penalized model selection with p large but fixed. We show that the efficiency 
of ECQR can be substantially improved by properly weighting and extend the work to 
the case of p ^ n. Consider K different quantiles, < ri < T2 < ... < tr- < 1. Let 

Pk{t) = Tk{t ~ bk)+ + (1 — Tfc)(t — The penalized composite quantile regression 

/V cqr 

estimator /3 is defined as the solution to the minimization problem 

K n p 

arg min V^«,J]pfc(y,-Xf/3)+nJ]7A(|/3f 1)1/3,1, (23) 

fc=i i=i j=i 

where 6^ is the estimator of the nuisance parameter 6^ = F~^(ti^), the r^-th quantile of 
the error distribution. Note that 61, • • • ,bK are nuisance parameters and the minimization 
at (23) is done with respect to them too. After some algebra we can confirm that the 
conclusions of Theorems 3.2 and 3.3 continue to hold with the asymptotic variance as 

f^cqr(w) = ; 7^ . (24) 



As shown in Koenker (1984) and Bickel (1973), when K — )• 00, the optimally weighted 
CQR (WCQR) is as efficient as the maximum likelihood estimator, always more efficient 
than ECQR. Computationally, the minimization problem in equation (23) can be casted 
as a large scale linear programming problem by expanding the covariate space with new 
ancillary variables. Thus, it is computationally intensive to use too many quantiles. In 
Section 4.3, we can see that usually no more than ten quantiles are adequate for WCQR 
to approach the efficiency of MLE, whereas determining the optimal value of K in ECQR 
seems difficult since the efficiency is not necessarily an increasing function of K (Table 2). 
Also, some of the weights in Wopt are zero, hence making WCQR method computationally 
less intensive than ECQR. From our experience in large p and small n situations, this 
reduction tends to be significant. 

The optimal convex combination of quantile regression uses the weight 

Wopt = argmin^>o,aTw=iW^Mw, (25) 

where a = [f[F^^{Ti)), • • • , f {F~^ {tk)))'^ and 'Wl'is a. K x K matrix whose (i, j)-element 
is min(rj,rj) — TiTj. The optimal combination of quantile regression, which is obtained by 



14 



using the one-step procedure, uses the weight 

v^opt = M-^a. (26) 

Clearly, both combinations improve the efficiency of ECQR and the optimal combination 
is most efficient among the three (see Table 1). When the error distributions are skewed 
or multimodal, the improvement can be substantial. 

4.3 Asymptotic Efficiency Comparison 

In this section, we studied the asymptotic efficiency of proposed estimators under several 
error distributions. For comparison, we also included Li regression, L2 regression and 
ECQR. The error distribution ranges from the symmetric to asymmetric distributions: 
double exponential (DE), t distribution with degree of freedoms 4 (^4), normal A/'(0, 1), 
Gamma r(3, 1), Beta B (3, 5), a scale mixture of normals (MN^) 0.1iV(0, 25)+0.9iV(0, 1) and 
a location mixture of normals (MN;) 0.7A^( — 1, 1) + 0.3A^(7/3, 1). To keep the comparison 
fair and to satisfy the first assumption of mean zero error terms, we first centered the error 
distribution to have mean zero. 

Table 1 shows the asymptotic relative efficiency of each estimator compared to MLE. 
L1-L2 and L1-L2 indicate the optimal convex L1-L2 combination and optimal L1-L2 com- 
bination, respectively. While Li regression can have higher or lower efficiency than L2 
regression in different error distributions, Li-L^ and L1-L2 regressions are consistently 
more efficient than both of them. WCQR"*" denote the optimal convex combination of 
multiple quantile regressions and WCQR represent the optimal combination. In all quan- 
tile regressions, quantiles {j^^, j^^) were used. As shown in Table 1, WCQR"*" and 
WCQR always outperform ECQR and the differences are more significant in double ex- 
ponential distribution and asymmetric distributions such as Gamma and Beta. In DE, t4 
and AA(0, 1), nine quantiles are usually adequate for WCQR^ and WCQR to achieve full 
efficiency. In r(3, 1) and B{3, 5), they need 29 quantiles to achieve efficiency close to MLE 
while the other estimators are significantly inefficient. This difference is most expressed 
in multimodal distributions, MN^ and MN^, with WCQR outperforming all. One of the 
possible problems with ECQR is that the efficiency does not necessarily increase with K, 
making the choice of K harder. For example, for the double exponential distribution, the 
relative efficiency decreases with K. This is understandable, as i^T = 1 is optimal: Putting 
more and odd number of quantiles dilutes the weights. 
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Table 1: Asymptotic relative efficiency compared to MLE 



/(e) 




DE 


ti 


A/'(0, 1) 


r(3,i) 


^(3,5) 


MN, 


MN/ 




Li 


1.00 


0.80 


0.63 


0.29 


0.41 


0.61 


0.35 




L2 


0.50 


0.35 


1.00 


0.13 


0.68 


0.05 


0.14 




L1-L2 


i.oon 


0.85 


1.00 


0.34 


0.68 


0.61 


0.63 




L1-L2 


1.00'' 


0.85 


1.00 


0.44 


0.80 


0.61 


0.63 


ECQR 


K = 3 


0.84 


0.94 


0.86 


0.43 


0.59 


0.76 


0.44 




5 


0.83 


0.97 


0.89 


0.47 


0.65 


0.78 


0.50 




9 


0.82 


0.97 


0.92 


0.49 


0.68 


0.77 


0.52 




19 


0.82 


0.97 


0.94 


0.50 


0.69 


0.75 


0.54 




29 


().R:^ 


(197 


n.95 


0.51 


0.71 


0.76 


n,54 


WCQR+ 


K = 3 


0.951" 


0.94 


0.87 


0.51 


0.61 


0.76 


0.60 




5 


0.96 


0.97 


0.91 


0.59 


0.70 


0.78 


0.69 




9 


0.97 


0.98 


0.95 


0.69 


0.78 


0.79 


0.77 




19 


0.98 


0.99 


0.98 


0.80 


0.86 


0.80 


0.83 




29 


0.99 


0.99 


0.99 


0.85 


0.90 


0.80 


0.84 


WCQR 


K = 3 


0.95^ 


0.94 


0.87 


0.51 


0.61 


0.76 


0.61 




5 


0.96 


0.97 


0.91 


0.60 


0.72 


0.78 


0.76 




9 


0.98 


0.98 


0.95 


0.70 


0.80 


0.79 


0.88 




19 


0.99 


0.99 


0.98 


0.81 


0.88 


0.92 


0.95 




29 


0.99 


0.99 


0.99 


0.86 


0.92 


0.93 


0.97 
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Table 2: Optimal weights of convex composite quantile regression with K = 9 quantiles 







-f- 

14 


JV [U, L) 




o{6,b) 


i\/ri\T 

iViiN g 


i\/ri\T, 

iViiN ; 


Quantile 
















i/iU 





r\ 

u 


0.20 


O.OD 


0.39 


0.06 


0.36 


2/10 





0.12 


0.11 


0.15 


0.10 


0.23 


0.11 


3/10 





0.14 


0.09 


0.08 


0.11 


0.17 


0.10 


4/10 





0.14 


0.08 


0.06 


0.05 


0.10 


0.01 


5/10 


1 


0.16 


0.06 


0.05 





0.14 





6/10 





0.14 


0.08 


0.04 











7/10 





0.14 


0.09 





0.05 








8/10 





0.12 


0.11 





0.09 








9/10 








0.20 


0.05 


0.20 


0.30 


0.29 



In Table 2 we illustrate both the adaptivity of the proposed composite QMLE method- 
ology and computational efficiency of WCQR^ over ECQR by showing the positions of 
zero of the optimal nonnegative weight vector w^j. For K = 9, only 1 quantile is needed 
in the DE case, 5 and 6 quantiles are needed for MN/ and MNg and 7 quantiles for t^, 
Gamma and Beta. Only in the normal distribution, all 9 quantiles are used. Therefore, 
WCQR"^ can dramatically reduce the computational complexity of ECQR in large scale 
optimization problems where n. 

5 Finite Sample Study 
5.1 Simulated example 

In the simulation study, we consider the classical linear model for testing variable selection 
methods used by Fan and Li (2001) 

y = x^/3* + e, x~iV(0,5],), (Sx)^^ = (0.5)l*-^'l. 

The error vector varies from uni- to multi-modal and heavy to light tails distributions in 
the same way as in Tables 1 and 2, and is centered to have mean zero. The data has 
n = 100 observations. We considered two settings where p = 12 and p = 500, respec- 
tively. In both settings, /32, /^s) = (3,1.5,2) and the other coefficients are equal to 
zero. We implemented penalized Li, L2, composite Li-L^, L1-L2, ECQR, WCQR"*" and 
WCQR using quantiles (10%,20%,--- ,90%). The local linear approximation of SCAD 
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penalty (6) was used and the tuning parameter in the penalty was selected using five fold 
cross validation. We compared different methods by: (1) model error, which is defined as 
ME(3) = 0- (3*)^ E{X'^ X)0- P*); (2) the number of correctly classified non-zero coeffi- 
cients , i.e. the true positive (TP); (3) the number of incorrectly classified zero coefficients, 
i.e. the false positive (FP); (4) the multiplier (Jw of the standard error (SE)(12). A total 
of 100 replications were performed and the median of model error (MME), the average of 
TP and FP are reported in Table 3. The median model errors of oracle estimators were 
calculated as the benchmark for comparison. 

From the results presented in Table 3 and Table 4, we can see that penalized composite 
L1-L2 regression takes the smaller of the two model errors of Li and L2 in all distributions 
except in B{3,5) where it outperforms both. As expected, optimal L1-L2 outperforms 
L1-L2 and brings a smaller number of FP, especially in multimodal and unsymmetric 
distributions. Also, both Li-L^ and L1-L2 perform reasonably well when compared to 
ECQR, but with much less computational burden. WCQR'^ and WCQR in both Tables 3 
and 4 have smaller model errors and smaller number of false positives than ECQR. Similar 
conclusions can be made from Figure 1, which compares the boxplots of the model errors 
of the five methods (WCQR"*" and Li — are not shown) under different distributions in 
the case of n = 100, p = 500. For p <^ n in Table 3 we didn't include LASSO estimator 
since it behaves reasonably well in that setting. For p ^ n in Table 4, we included LASSO 
estimator as a reference. Table 4 shows that LASSO has bigger model errors, more false 
positives and higher standard errors (usually by a factor of 10) than any other five SCAD 
based methods discussed. 

In addition to the ME in Tables 3 and 4, we reported the multiplier it^ of the asymptotic 
variance (see equation (12)). Being the only part of SE that depends on the choice of 
weights w and loss functions pk, we explored it's behavior when the dimensionality grows 
from p <^ n to p ^ n. Both Tables 3 and 4 confirm the stability of the formula throughout 
the two settings and all five CQMLE methods. Only Lasso estimator being unable to specify 
the correct sparsity set when p ^ n, inflates for one order of magnitude compared to 
other CQMLEs. Note that WCQR"*" keeps the smallest value of ct^ and all Li-L2,Li-L2 , 
WCQR and WCQR^ have smaller SEs than the classical Li, L2 or ECQR methods. 
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Table 3: Simulation results (n = 100, p = 12) where f, J represent Median model error 
(MME) of the oracle and penalized estimator respectively 



fie) 


DE 


U 


AA(0,3) 


r(3,i) 


6(3,5) 




Li 


Oracle 


0.029f 


0.050 


0.122 


0.082 


0.0010 


0.043 




Penalized 


0.035^ 


0.053 


0.128 


0.097 


0.0011 


0.051 




(TP, FP) 


(3,1.83) 


(3,0.8) 


(3,0.84) 


(3,1) 


(3,0.54) 


(3,0.93) 




SDxlO^ 


0.646 


0.767 


0.570 


0.950 


0.112 


0.244 




Oracle 


0.047 


0.043 


0.073 


0.064 


0.00056 


0.083 




Penalized 


0.059 


0.054 


0.106 


0.100 


0.0011 


0.091 




(TP, FP) 


(3,0.82) 


(3,1.61) 


(3,1.89) 


(3,1.35) 


(3,3.76) 


(3,1.47) 




SDxlO^ 


0.779 


0.762 


0.485 


0.869 


0.129 


0.179 


L1-L2 


Oracle 


0.036 


0.043 


0.070 


0.070 


0.00061 


0.051 




Penalized 


0.037 


0.049 


0.102 


0.099 


0.00077 


0.058 




(TP, FP) 


(3,2.49) 


(3,2.39) 


(3,1.97) 


(3,2.09) 


(3,2.42) 


(3,2.69) 




SDxlOi 


0.717 


0.702 


0.518 


0.876 


0.095 


0.169 


L1-L2 


Oracle 


0.036 


0.043 


0.070 


0.063 


0.00060 


0.051 




Penalized 


0.037 


0.049 


0.102 


0.078 


0.00063 


0.058 




(TP, FP) 


(3,2.49) 


(3,2.39) 


(3,1.97) 


(3,2.05) 


(3,2.42) 


(3,2.69) 




SDxlO^ 


0.717 


0.702 


0.518 


0.846 


0.075 


0.169 


ECQR 


Oracle 


0.031 


0.046 


0.069 


0.063 


0.00065 


0.033 




Penalized 


0.042 


0.046 


0.107 


0.074 


0.00091 


0.040 




(TP, FP) 


(3,1.88) 


(3,1.57) 


(3,2.04) 


(3,1.83) 


(3,1.88) 


(3,1.38) 




SDxlO^ 


0.654 


0.562 


0.488 


0.813 


0.087 


0.177 


WCQR+ 


Oracle 


0.033 


0.047 


0.068 


0.052 


0.00065 


0.036 




Penalized 


0.039 


0.041 


0.100 


0.054 


0.00070 


0.037 




(TP, FP) 


(3,0.55) 


(3,1.47) 


(3, 0.74) 


(3,0.61) 


(3,0.98) 


(3,0.62) 




SDxlO^ 


0.440 


0.612 


0.498 


0.715 


0.071 


0.174 


WCQR 


Oracle 


0.033 


0.047 


0.068 


0.048 


0.00058 


0.028 




Penalized 


0.039 


0.041 


0.100 


0.050 


0.00062 


0.030 




(TP, FP) 


(3,0.55) 


(3,1.47) 


(3, 0.74) 


(3,0.61) 


(3,0.98) 


(3,0.62) 




SDxlO^ 


0.440 


0.612 


0.498 


0.650 


0.061 


0.132 
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Table 4: Simulation results (n = 100, p = 500) were f, f are Median model error (MME) 
of oracle and penalized estimator respectively 



fie) 


DE 




AA(0,3) 


r(3,i) 


fi(3,5) 


MN, 


Lasso 


Oracle 
Penalized 
(TP,FP) 
SDxlO^ 


O.O39T 
1.775^ 
(3,94.46) 
3.336 


0.039 
1.759 
(3,94.26) 
3.257 


0.035 
8.687 
(3,96.80) 
0.578 


0.0719 
2.662 
(3,95.59) 
3.167 


0.062 
1.808 
(3,86.88) 
0.989 


0.176 
6.497 
(3,96.55) 
0.539 




Oracle 
Penalized 
(TP,FP) 
SDxlO^ 


0.025 
0.035 
(3,4.53) 
0.268 


0.031 
0.039 
(3,4.47) 
0.274 


0.382 
1.342 
(3,5.32) 
0.144 


0.096 
0.131 
(3,4.56) 
0.461 


0.0094 
0.0120 
(3,8.10) 
0.215 


0.281 
0.514 
(3,4.58) 
0.101 


L2 


Oracle 
Penalized 
(TP,FP) 
SDxlQi 


0.035 
0.093 
(3,12.31) 
0.865 


0.043 
0.086 
(3,10.64) 
0.828 


0.207 
1.187 
(3,11.00) 
0.281 


0.078 
0.175 
(3,8.02) 
0.168 


0.0057 
0.0073 
(3,18.75) 
0.396 


0.187 
0.764 
(3,16.93) 
0.238 


Li-L^ 


Oracle 
Penalized 
(TP,FP) 
SDxlO^ 


0.193 
0.036 
(3,17.92) 
0.226 


0.035 
0.036 
(3,12.58) 
0.235 


0.224 
1.160 
(3,15.87) 
0.396 


0.080 
0.097 
(3,15.43) 
0.144 


0.0061 
0.0077 
(3,14.05) 
0.235 


0.195 
0.576 
(3,17.92) 
0.207 


L1-L2 


Oracle 
Penalized 
(TP,FP) 
SDxlO^ 


0.035 
0.036 
(3,17.92) 
0.226 


0.035 
0.036 
(3,12.58) 
0.235 


0.224 
1.160 
(3,15.87) 
0.905 


0.079 
0.095 
(3,15.43) 
0.150 


0.0050 
0.0069 
(3,14.05) 
0.190 


0.195 
0.576 
(3,17.92) 
0.207 


ECQR 


Oracle 
Penalized 
(TP,FP) 
SDxlO^ 


0.029 
0.060 
(3,8.71) 
0.469 


0.024 
0.070 
(3,8.43) 
0.475 


0.252 
0.764 
(3,7.78) 
0.153 


0.057 
0.148 
(3,9.59) 
0.716 


0.0064 
0.0118 
(3,9.69) 
0.213 


0.207 
0.599 
(3,8.91) 
0.139 


WCQR+ 


Oracle 
Penalized 
(TP,FP) 
SDxlQi 


0.028 
0.045 
(3,3.97) 
0.244 


0.027 
0.037 
(3,3.76) 
0.266 


0.223 
0.595 
(3, 3.93) 
0.112 


0.050 
0.079 
(3,3.66) 
0.273 


0.0066 
0.0076 
(3,4.85) 
0.120 


0.204 
0.368 
(3,4.05) 
0.084 


WCQR 


Oracle 
Penalized 
(TP,FP) 
SDxlO^ 


0.028 
0.045 
(3,3.97) 
0.224 


0.027 
0.037 
(3,3.76) 
0.219 


0.223 
0.595 
(3,3.93) 
0.112 


0.048 
0.062 
(3,3.66) 
0.180 


0.0048 
0.0060 
(3,4.85) 
0.110 


0.160 
0.280 
(3,4.05) 
0.060 



20 



Double Exponential 



T4 



Normal 




Figure 1: Boxplots of Median model error (MME) of Li, L2, L1-L2, ECQR and WCQR 
methods under different distributional settings with n = 100,p = 500 

5.2 Real Data Example 

In this section, we apphed proposed methods to expression quantitative trait locus (eQTL) 
mapping. Variations in gene expression levels may be related to phenotypic variations such 
as susceptibility to diseases and response to drugs. Therefore, to understand the genetic 
basis of gene expression, variation is an important topic in genetics. The availability of 
genome- wide single nucleotide polymorphism (SNP) measurement has made it possible and 
reasonable to perform the high resolution eQTL mapping on the scale of nucleotides. In 
our analysis, we conducted the cis-eQTL mapping for the gene CCT8. This gene is located 
within the Down Syndrome Critical Region on human chromosome 21, on the minus strand. 
The over expression of CCT8 may be associated with Down syndrome phenotypes. 

We used the SNP genotype data and gene expression data for the 210 unrelated indi- 
viduals of the International HapMap project (International HapMap Consortium, 2003) , 
which include 45 Japanese in Tokyo, Japan, 45 Han Chinese in Beijing, China, 60 Utah 
parents with ancestry from northern and western Europe (CEPH) and 60 Yoruba par- 
ents in Ibadan, Nigeria and they are available in PLINK format (Purcell, et al 2007) 
[http://pngu.mgh.harvard.edu/purcell/plink/]. We included in the analysis more than 2 
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million SNPs with minor allele frequency greater than 1% and missing data rate less than 
5%. The gene expression data were generated by Illumina Sentrix Human-6 Expression 
BeadChip and have been normalized (ith quantile normalization across replicates and me- 
dian normalization across individuals) independently for each population (Stranger, et al 
2007) [ftp:/ /ftp. Sanger. ac.uk/pub/genevar/]. 

Specifically, we considered the cis-candidate region to start 1 Mb upstream of the tran- 
scription start site (TSS) of CCT8 and to end 1 Mb downstream of the transcription end 
site (TES), which includes 1955 SNPs in Japanese and Chinese, 1978 SNPs in CEPH and 
2146 SNPs in Yoruba. In the following analysis, we grouped Japanese and Chinese together 
into the Asian population and analyzed the three populations Asian, CEPH and Yoruba 
separately. The additive coding of SNPs (e.g. 0,1,2) was adopted and was treated as cate- 
gorical variables instead of continuous ones to allow non-additive effects, i.e., two dummy 
variables will be created for categories 1 and 2 respectively. The category represents 
the major, normal population. The missing SNP measurements were imputed as O's. The 
response variable is the gene expression level of gene CCT8, measured by microarray. 

In the first step, the ANOVA F-statistic was computed for each SNP independently 
and a version of independent screening method of Fan and Lv (2008) was implemented. 
This method is particularly computationally efficient in ultra-high dimensional problems 
and here we retained the top 100 SNPs with the largest F-statistics. In the second step, 
we applied to the screened data the penalized L2, Li, Li-L^, L1-L2, ECQR, WCQR"*" and 
WCQR with local linear approximation of SCAD penaly. All the four composite quantile 
regressions used quantiles at (10%, 90%). LASSO was used as the initial estimator and 
the tuning parameter in both LASSO and SCAD penalty was chosen by five fold cross 
validation. In all the three populations, the L1-L2 and Li-L^ regressions reduced to L2 
regression. This is not unexpected due to the gene expression normalization procedure. In 
addition, WCQR reduced to WCQR"*". The selected SNPs, their coefficients and distances 
from transcription starting site (TSS) are summarized in Tables 5, 6 and 7. 

In Asian population (Table 5), the five methods are reasonably consistent in not only 
variables selection but also coefficients estimation (in terms of signs and order of magni- 
tude). WCQR uses the weights (0.19,0.11,0.02,0,0.12,0.09,0.18, 0.19,0.10). There are 
four SNPs chosen by all the five methods. Two of them, rs2832159 and rs2245431, up- 
regulate gene expression while rs9981984 and rsl6981663 down-regulate gene expression. 
The ECQR selects the largest set of SNPs while Li regression selects the smallest set. In 
CEPH population (Table 6), the five methods consistently selected the same seven SNPs 
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Table 5: eQTLs for gene CCT8 in Japanese and Chinese (n = 90). ** is the indicator for 
SNP equal to 2 and otherwise is the indicator for 1. SE of the estimates is reported in the 
parenthesis. 



SNP 




L2 


1 2 


Li 

± 


ECQR 


WCQR+ 


Distance from 








L1-L2 






WCQR 


TSS (kb) 


rsl6981663 


-0.11 


(0.03) 


-0.11 (0.03) 


-0.09 (0.04) 


-0.10 (0.03) 


-0.09 (0.03) 


-998 


rsl6981663** 


0.08 


(0.06) 


0.08 (0.06) 




0.04 (0.06) 




-998 


rs9981984 


-0.12 


(0.03) 


-0.12 (0.03) 


-0.10 (0.04) 


-0.09 (0.03) 


-0.12 (0.03) 


-950 


rs7282280 










0.05 (0.03) 




-231 


rs7282280** 










-0.07 (0.05) 




-231 


rs2245431** 


0.33 


(0.10) 


0.33 (0.10) 


0.36 (0.11) 


0.37 (0.09) 


0.38 (0.10) 


-89 


rs2832159 


0.21 


(0.04) 


0.21 (0.04) 


0.30 (0.04) 


0.20 (0.04) 


0.23 (0.04) 


13 


rsl999321** 


0.11 


(0.07) 


0.11 (0.07) 




0.14 (0.07) 




84 


rs2832224 


0.07 


(0.03) 


0.07 (0.03) 




0.06 (0.03) 


0.04 (0.03) 


86 



with only ECQR choosing two additional SNPs. WCQR uses the weight (0.19, 0.21, 0, 0.04, 
0.03,0.07,0.1,0.21,0.15). The coefficient estimations were also highly consistent. Deutsch 
et al (2007) performed a similar cis-eQTL mapping for the gene CCT8 using the same 
CEPH data as here. They considered a lOOkb region surrounding the gene, which contains 
41 SNPs. Using ANOVA with correction for multiple tests, they identified four eQTLs, 
rs965951, rs2832159, rs8133819 and rs2832160, among which rs965951 possessing the small- 
est p-value. Our analysis verified rs965951 to be an eQTL but did not find the other SNPs 
to be associated with the gene expression of CCT8. In other words, conditioning on the 
presence of SNP rs965951 the other three make little additional contributions. The anal- 
ysis of Yoruba population yields a large number of eQTLs (Table 7). The ECQR again 
selects the largest set of 44 eQTLs. The Li regression selects 38 eQTLs. The L2 regres- 
sion and WCQR both select 27 SNPs, 26 of which are the same. WCQR uses the weight 
(0.1,0,0.17,0.16,0.11,0.3,0,0,0.16). The coefficients estimated by different methods are 
mostly consistent (in terms of signs and order of magnitude), except that the coefficients 
estimates for rs8134601, rs7281691, rs6516887 and rs2832159 by ECQR and Li have dif- 
ferent signs from those of L2 and WCQR. The eQTLs are almost all located within 500kb 
upstream TSS or 500kb downstream TES (Figure 2) and mostly from lOOkb upstream TSS 
to 350 kb downstream TES. 
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Table 6: eQTLs for gene CCT8 in CEPH (n = 60). ** is the indicator for SNP equal to 2 



and otherwise is the indicator for 1. SE of the estimates is reported in the parenthesis. 



SNP 




L2 


L1-L2 






ECQR 


WCQR+ 
WCQR 


Distance from 
TSS (kb) 


rs2831459 


0.20 


(0.07) 


0.20 (0.07) 


0.19 


(0.08) 


0.17 (0.07) 


0.18 (0.07) 


-999 


rs7277536 


0.18 


(0.09) 


0.18 (0.09) 


0.09 


(0.11) 


0.14 (0.09) 


0.23 (0.09) 


-672 


rs7278456** 


0.36 


(0.11) 


0.36 (0.11) 


0.21 


(0.13) 


0.40 (0.11) 


0.35 (0.11) 


-663 


rs2248610 


0.08 


(0.04) 


0.08 (0.04) 


0.09 


(0.05) 


0.10 (0.05) 


0.06 (0.05) 


-169 


rs965951 


0.11 


(0.05) 


0.11 (0.05) 


0.13 


(0.06) 


0.03 (0.06) 


0.12 (0.05) 


-13 


rs3787662 


0.12 


(0.06) 


0.12 (0.06) 


0.08 


(0.07) 


0.13 (0.06) 


0.12 (0.06) 


78 


rs2832253 
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Figure 2: Chromosome locations of identified eQTLs of the gene CCT8 with grey region 
as the CCT8's coding region. The eQTLs selected by any of the five methods are shown. 
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Table 7: eQTLs of gene CCT8 in Yoruba (n = 60); ** is the indicator for SNP equal to 2 



and otherwise is the indicator for 1. SE of the estimates is reported in the parenthesis. 
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6 Discussion 



In this paper, a robust and efficient penalized quasi-Ukelihood approach is introduced for 
model selection with NP-dimensionality. It is shown that such an adaptive learning tech- 
nique has a strong oracle property. As specific examples, two complementary methods 
of penalized composite L1-L2 regression and weighted composite quantile regression are 
introduced and they are shown to possess good efficiency and model selection consistency 
in ultrahigh dimensional space. Numerical studies show that our method is adaptive to un- 
known error distributions and outperforms LASSO (Tibshirani, 1996) and equally weighted 
composite quantile regression (Zou and Yuan, 2008). 

The penalized composite quasi-likelihood method can also be used in sure indepen- 
dence screening (Fan and Lv, 2008; Fan and Song, 2010) or iterated version (Fan, et al, 
2009), resulting in a robust variable screening and selection. In this case, the marginal 
regression coefficients or contributions will be ranked and thresholded (Fan and Lv, 2008; 
Fan and Song, 2010). It can also be applied to the aggregation problems of classification 
(Bickel et al., 2009) where the usual L2 risk function could be replaced with composite 
quasi-likelihood function. The idea can also be used to choose the loss functions in ma- 
chine learning. For example, one can adaptively combine the hinge-loss function in the 
support vector machine, the exponential loss in the AdaBoost, and the logistic loss func- 
tion in logistic regression to yield a more efficient classifier. 

Appendix A: Regularity Conditions 

Let Dk be the set of discontinuity points of i^kit), which is a subgradient of pk- Assume that 
the distribution of error terms is smooth enough so that F^{yj^^^Dh) = 0. Additional 
regularity conditions on V'fc are needed, as in Bai et al. (1992). 

Condition 1 The function ipk satisfies i?[^fc(ei -|- c)] = afcC -|- o(|c|) as \c\ — )■ 0, for some 
Ofc > 0. For sufficiently small \c\, gki{c) = E[{Tpk{ei + c) - V'fc(ei))('0«(ei + c) - tpi{£i))] 
exists and is continuous at c = 0, where k,l = 1, ...,K. The error distribution satisfies the 
following Cramer condition: E \Tp^{ei)\"^ < mlRK"'"^ , for some constants R and K. 

This condition implies that Eipki^i) = 0, which is an unbiased score function of pa- 
rameter (3. It also implies that Ed^ki^i) = f^fc exists. The following two conditions are 
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important for establishing sparsity properties of parameter (3^ by controlling the penalty 
weighting scheme d and the regularization parameter A„. 

Condition 2 Assume that Dn = max{(ij : j G M*] = o(n°i-°o/2) andX^Dn = 0(n-(^+"o)/2)_ 
In addition, liminf minjdj : j G A^*} > 0. 

The first statement is to ensure that the bias term in Theorem 3.2 is negligible. It 
is needed to control the bias due to the convex penalty. The second requirement is to 
make sure that the weights d in the second part are uniformly large so that the vanishing 
coefficients are estimated as zero. It can also be regarded as a normalization condition, 
since the actual weights in the penalty are {A„,(ij}. 

The LASSO estimator will not satisfy the first requirement of Condition 2 unless 
A„ is small and ai > ao/2. Nevertheless, under the sparse representation condition 
(Zhao and Yu, 2006), Fan and Lv (2010) show that with probability tending to one, the 
LASSO estimator is model selection consistent with ||/3^ — /SJHoo = 0{n~^ \ogn)^ when 
the minimum signal /3* = min{|/3*|,j G A^*} > n^'^logn. They also show that the same 
result holds for the SCAD-type estimators under weaker conditions. Using one of them 
as the initial estimator, the weight dj = 7A(/3j)/A in (8) would satisfy Condition 2, on 
a set with probability tending to one. This is due to the fact that with 7a(') given by 
(6), for j E Ml, dj = 7a(0)/A = 1, whereas for j G M*, dj < 7a(/3^/2)/A = 0, as long as 
/3* ^ n^'^ logn = 0(A„). In other words, the results of Theorems 3.1 and 3.2 are applicable 
to the penalized estimator (8) with data driven weights. 

Condition 3 The regularization parameter A„, ^ ^-i/2+(Qo-2ai)+/2+a2 ^ where parameter 
ai is defined in Condition 5 and 02 S [0, 1/2) is a constant, bounded by the restriction in 
Condition 4- 

We use the following notation throughout the proof. Let B be a matrix. Denote by 
Aniin(B) and Amax(B) the minimum and maximum eigenvalue of the matrix B when it is 
a square symmetric matrix. Let ||B|| = Amax(B"^B) be the operator norm and ||B||oo the 
largest absolute value of the elements in B. As a result, || • || is the Euclidean norm when 
applied to a vector. Define ||B||2,oo = max||v||2=i ||Bv||oo- 
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Condition 4 The matrix S S satisfies Ci?i < Amin(S S) < Amax(S S) < for some 
positive constants Ci,C2- There exists > such that 

n 

j:(||S.||/nV2)(2+O^0, 
1=1 

where Sf is the i-th row of S. Furthermore, assume that the design matrix satisfies 
||X||oo = O(ni/2-("0-2ai)+/2-a2) and maxj^^^. ||X*||2 = 0{n), where X* is the j-th col- 
umn of X. 

Condition 5 Assume that 

sup \\Qdiag{d'iPM}Sh,oo = 0(ni-"i). 

/3eB{/3*,/3*) 

max A-J„ (S^(i^a5{a^w(/3)}S) = Op(n-i), 

^GB(/3*,/3*) 

where B{(3l,f3^) is an s-dimensional hall centered at PI with radius /3* and diag{dtl)^{l3)) 
is the diagonal matrix with i-th element equal to dil)^{Yi — SfP)- 

Appendix B: Lemmas 

Recall that X = (S, Q) and A^* = {1, • • • , s} is the true model. 

Lemma 6.1 Under Conditions 2 and 4, the penalized quasi-likelihood Ln{(3) defined by 
(9) has a unique global minimizer /3 = {/3i ,0"^)^, if 

n 

V'w (Yi - Xf ^) Si + nXndM. o sgn0,) = 0, (B.l) 

i=l 

||z0)|U <nA„, (B.2) 

where z((3) = dj^^ ° X^ILi (j^i ~ -^I^^ Qj' ^M, o-^d dj^c stand for the subvectors of 
d, consisting of its first s elements and the last p — s elements respectively, and sgn and 
o (the Hadamard product) in (B.l) are taken coordinatewise. Conversely, if (3 is a global 
minimizer of Ln{(3), then (B.l) holds and (B.2) holds with strict inequality replaced with 
non-strict one. 
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Proof of Lemma 6.1: Under conditions 2 and 4, Ln{P) is strictly convex. Necessary 
conditions (B.l) and (B.2) are direct consequences of the Karush-Kuhn- Tucker conditions 
of optimality. The sufficient condition follows from similar arguments as those in the proof 
of Theorem 1 in Fan and Lv (2010) and the strict convexity of the function -L(/3). 

Lemma 6.2 Under Conditions 1-5 we have that 

P°-ni2 = Op(v^+A„||do||), 
where do is the subvector of d, consisting of its first s elements. 

Proof of Lemma 6.2: Since ^2 — f^2 — 0> only need to consider the sub-vector of the 
first s components. Let us first show the existence of the biased oracle estimator. We can 
restrict our attention to the s-dimensional subspace {(3 ^ W : (3j^c = 0}. Our aim is to 
show that 

P ( inf L„ iPl + 7,u, 0) > LniP*)) ^ 1, (B.3) 

for sufficiently large ^n- Here, there is a minimizer inside the ball — < "Yn, with 
probability tending to one. Using the strict convexity of Ln{P), this minimizer is the unique 
global minimizer. 

By the Taylor expansion at 7^ = 0, we have 

Ln iPl + 7nU, 0) - LniPl, 0) = Ti + Ta, 

where 

n n 

Ti = -7„ J^V'w(e^)Sfu + -72^9V^w(e^-7nS^u)(Sfu)2 

i=l 1=1 

= -h + h 

s 

T2 = n\nY,dj{\P*+^nUj\-\P*\). 

where 7^ S [0,7n]- By the Cauchy-Schwarz inequality, 

\T2\ < n7„A„||do||||u|| = n7„A„||do||. 
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Note that for all ||u|| = 1, we have 



l^ll < 7n|| y^^w(£i)Si 



i=l 

and 

n / n \ 1/2 

1=1 \ i=l / 

which is of order 0{y/ns) by Condition 4. Hence, Ii = Op{'~fnVns) uniformly in u. 

Finally, we deal with I2. Let Hi{c) = inf j^j<c{(9V'w(ei — v)}. By Lemma 3.1 of Portnoy 
(1984), we have 



h > 7^X^^^(7n|Sfu|)(Sfu) 



i=l 

> Clin, 

for a positive constant c. Combining all of the above results, we have with probability 
tending to one that 

Ln{/3l + jnn,0)-Ln{f3l,0) > n7„{c7„ - Op(v/^) - A„||do||}, 

where the right hand side is larger than when 7„ = B{y^s/n + A„||do||) for a sufficiently 
large B > 0. Since the objective function is strictly convex, there exists a unique minimizer 
Pi such that 

0l-f3l\\=Op{y^ + Xn\\do\\). 



Lemma 6.3 Under the conditions of Theorem 3.2, 

n 

[b^A„b]-V2 V'w(e^)b^S, ^ AA(0, 1) (B.4) 
1=1 

where A„ = ^V'w(e)S'^S . 
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Proof of Lemma 6.3: By Condition 1, since Si is independent of ip^vi^i)-, we have 
Eijj^{ei)Si = 0, and 



Var 



i=l 



(B.5) 



To complete proof of the lemma, we only need to check the Lyapounov condition. By 
Condition 1, £J|^vv(£)P^^ < oo- Furthermore, Condition 4 implies 



b'^A„b = E^'^{e)h' SS' h > cm. 



TaaTi 



for a positive constant ci. Using these together with the Cauchy-Schwartz inequality, we 
have 



J^i?|[b^A„b]-i/Vw(ei)b^S, 

2+5 



i=l 



2+e 



1=1 



0(l)^|n-i/2||S. 



i=l 



2+e 



which tends to zero by Condition 4. This completes the proof. 

The following Bernstein's inequality can be found in Lemma 2.2.11 of 
der Vaart and Wellner (1996). 

Lemma 6.4 Let Yi, ■ ■ ■ ,Yn be independent random variables with zero mean such that 
E\Yi\^ < m\M^~'^Vi/2, for every m>2 (and all i) and some constants M and vi. Then 



P(ln + -- + l'„l>t)<2exp{-^^_^}. 



for V > vi + ■ ■ ■ Vn- 



Then the following inequality (B.6) is a consequence of previous Bernstein's inequality. 
Let {Yi} satisfy the condition of Lemma 6.4 with Vi = 1. For a given sequence {ai}, 
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E\aiYi\"^ < ml\aiM\"^ /2- A direct application of Lemma 6.4 yields 



P{\aiYi + --- + anYn\ > t) < 2exp{- 



2(Er=ia- +Mmaxi|ai|t) 



}• (B.6) 



Appendix C: Proofs of Theorems 



Proof of Theorem 3.1: We only need to show that /9 is the unique minimizer of L{P) 
in MP on a set which has a probability tending to one. Since Pi already satisfies (B.l), 
we only need to check (B.2). 

We now define the set r2„,. Let 



^ = (6 , • • • , Cp)"" = 5^ V'w (l^ - Xf /3*) X, 



i=l 



and consider the event Or 



'Mi 



< Uny/n^ with Un being chosen later. Then, by 



Condition 1 and Bernstein's inequality, it follows directly from (B.6) that 



P{\^j\ >t} < 2exp 



2(||X*||2i? + tif||X*||oo) 
where X^ is the j-th column of X. Taking t = Un\^, we have 



Pjl^jl > n„,\/n} < 2exp 



2(i?||X*||2/n + irn„||X*|U/^/^) 



(C.l) 



for some positive constant c > 0, by Condition 4. Thus, by using the union bound, we 
conclude that 

We now check whether (B.l) holds on the set Let 'ip^{f3) be the n-dimensional 
vector with the i-th element V'w(^ — Xf/3). Then, by Condition 2 



< 



O ( n^l'^un + 



+ 



dXjcoQ^[^^0'')-^^(/3*)] 



Q^diag(9^^(v))S(/3i-/31) 



(C.2) 
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where v lies between /3 and /3J. By Condition 5, the second term in (C.2) is bounded by 

O{n'-''^)0l - (3l\\ = OpK-°H\A7^+ Anildoll)}, 

where the equahty follows from Lemma 6.2. By the choice of parameters, 

(nAO-1z0°)|U = 0{n-i/2A-i(n„ + n'^-o-2a,)/2-^ ^ Z)„n"»/2-"i} = ^(l), 

by taking n„ = 72("o-2ai)+/2+02_ jjence, by Lemma 6.1, /3 is the unique global minimizer. 

Proof of Theorem 3.2: By Theorem 3.1, = j3i almost surely. It follows from 
Lemma 6.2 that 

WKl -I3*l\\= Op{^s{\nDn + l/V^)}. 

This establishes the first part of the Theorem. 

Let Qn{(3i) = ^^^i i^vfiXi — Sff3i)Si. By Taylor's expansion at the point PI, we have 

where v lies between the points j3^i and f3\ and 

n 

9Qn(v) = -Y. - Sf v)SiSf . (C.3) 

i=l 

By Lemma 6.2, ||v - < - PIW = op(l). 

By using (B.2), we have 

<5n(^wi) + nXndQ o sgn(/3^i) = 0, 

or equivalently, 

^wi -P\ = -dQn{^)-^Qn{(3*i) - aQ„(v)-VA„do o sgn^^i). (C.4) 
Note that ||do o sgn(/3^;^)|| = ||do||. We have for any vector u, 

u^5Q„(v)"Mo o sgn(^^i) < ||aQ„(v)-i|| • ||u|| • ||do||. 
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Consequently, for any unit vector b, 



< xUl{S^S)X^JdQnM)V~sDr, 



by using Conditions 4 and 5. This shows that the second term in (C.4), when multiphed 
by the vector b^(S'^S)^/2 is of order 

Op{-s/sn\nDn) = Op(l), 

by Condition 2. Therefore, we need to estabhsh the asymptotic normahty of the first term 
in (C.4). This term is identical to the situation dealt by Portnoy (1985). Using his result, 
the second conclusion of Theorem 3.2 follows. This completes the proof. 

Proof of Theorem 3.3: First of all, by Taylor expansion. 



$n,w(/3l) = ^nAPl) + f^n,w(/9l)(/3l " l^l), 

where Pi lies between I3\ and Consequently, 

\\Pi-h\\<Wi-h\\ = op{i). 

By the definition of the one step estimator (14) and (C.5), we have 

^PZi -P\ = f^n,w(/9l)-'^n,w(/3D + Rn, 

where 

We first deal with the remainder term. Note that 



(C.5) 



(C.6) 



|Rn|| < 



and 



T 

i ) 



(C.7) 



(C.^ 



i=l 
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where fi0i,l3i) = d'4){Yi - SffS^) - d^{Yi - Sf/3i). By the Liptchiz continuity, we have 

where C is the Liptchiz coefficient of dip-w{-). Let be the identity matrix of order s and 
bn = Amax{I]r=i ll^illSiSf }. By (C.8), we have 

n 

1=1 

Hence, ah of the eigenvalues of the matrix is no larger than C||/3i — ^^Hfen.. Similarly, by 
(C.8), 

n 

1=1 

and all of its eigenvalue should be at least — C||/3;^ — /9i||6n- Consequently, 

^n,M) - ^n,UPl)\\ < C01 - Mbn- 

By Condition 5 and the assumption of (3i, it follows from (C.7) that 

llRnll = Op{s/n ■ bn/n) = Op(s3/2/n). 

Thus, for any unit vector b, 

b^(S^S)V2R„ < aV2js^S)||R„|| = Op(s3/VnV2) = op(l). 

The main term in (C.6) can be handled by using Lemma 6.3 and the same method as 
Portnoy (1985). This completes the proof. 
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